[#8763][feature] AutoDeploy: configurable dtype for caching #8812

lucaslie · 2025-10-30T20:56:02Z

Description

Configurable cache config via yaml or CLI args
Utility to merge cache config from user and the factory (user takes precedence)
Fix in triton mamba kernel to ensure ssm state dtype is respected

mamba cache can be configured via

args:
  transforms:
    insert_cached_ssm_attention:
      cache_config:
        mamba_dtype: float32

or

--args.transforms.insert_cached_ssm_attention.cache_config.mamba_dtype=float32

for extra_llm_args in trtllm-bench or serve remove the args prefix

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-30T20:59:58Z

📝 Walkthrough

Walkthrough

These changes extend the Mamba SSM caching infrastructure by introducing a dedicated dtype configuration option, updating metadata tuple signatures to include additional state information, and refactoring the Triton backend to inherit from the Torch backend implementation to eliminate duplicate interface methods.

Changes

Cohort / File(s)	Summary
Cache Configuration Enhancement `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`	Added optional `mamba_dtype` field to `CacheConfig` dataclass to support dtype-specific caching logic for Mamba operations.
Torch Backend Mamba Updates `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py`	Updated `get_prepare_metadata_op` to return 4-tuple including `use_initial_states` flag. Modified `get_cache_initializers` to derive dtype from source node metadata and use `cache_config.mamba_dtype` as fallback for SSM cache construction.
Triton Backend Refactoring `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py`	Changed `TritonBackendSSM` inheritance from `AttentionDescriptor` to `TorchBackendSSM`. Removed 8 duplicate methods (`is_paged`, `get_attention_layout`, `get_num_qkv_args`, `get_source_attention_op`, `get_prepare_metadata_op`, `get_cache_initializers`, `get_global_buffer_initializers`, `get_constants`). Retained only `get_cached_attention_op` as the public interface. Updated imports accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TorchBackendSSM as get_cache_initializers<br/>(Torch Backend)
    participant NodeMeta as source_attn_node<br/>.meta["val"]
    participant CacheConfig
    participant SSMCache as ssm_state_cache

    Caller->>TorchBackendSSM: get_cache_initializers(source_attn_node, cache_config)
    TorchBackendSSM->>NodeMeta: Extract dtype
    NodeMeta-->>TorchBackendSSM: dtype (from node metadata)
    TorchBackendSSM->>CacheConfig: Check cache_config.mamba_dtype
    alt mamba_dtype available
        CacheConfig-->>TorchBackendSSM: mamba_dtype
        TorchBackendSSM->>SSMCache: Create with mamba_dtype
    else mamba_dtype not set
        CacheConfig-->>TorchBackendSSM: None
        TorchBackendSSM->>SSMCache: Use node dtype as fallback
    end
    SSMCache-->>Caller: Cache with resolved dtype

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

triton_backend_mamba.py — Requires careful verification that inheriting from TorchBackendSSM correctly replaces all 8 removed methods and that the public API contract is maintained. Ensure no unintended behavioral changes from inheritance-based method resolution.
torch_backend_mamba.py — Verify dtype resolution logic correctly prioritizes cache_config.mamba_dtype over node dtype, and confirm the expanded 4-tuple return from get_prepare_metadata_op is handled correctly by all callers.
attention_interface.py — Confirm backward compatibility; the optional field should not affect existing code paths that don't use it.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description lacks proper structure; missing concrete implementation details under Description and Test Coverage sections that are marked incomplete.	Fill in the Description section with a clear explanation of the issue and solution, and the Test Coverage section with relevant test cases that validate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[#8763][feature] AutoDeploy: configurable dtype for caching' directly matches the main code changes, which introduce an optional mamba_dtype field for configurable cache dtype handling across attention and mamba backend implementations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

suyoggupta · 2025-11-10T05:45:58Z

what's the status of this PR? @lucaslie

lucaslie · 2025-11-10T19:16:01Z

/bot run

tensorrt-cicd · 2025-11-10T19:21:26Z

PR_Github #24045 [ run ] triggered by Bot. Commit: 45f971e

tensorrt-cicd · 2025-11-10T20:41:12Z

PR_Github #24045 [ run ] completed with state SUCCESS. Commit: 45f971e
/LLM/main/L0_MergeRequest_PR pipeline #18119 completed with status: 'FAILURE'

lucaslie

~~There is a perf regression that I am still looking into~~

edit: re-ran the benchmark and looks good, see results below: #8812 (comment)

lucaslie · 2025-11-11T00:46:05Z

/bot run

tensorrt-cicd · 2025-11-11T00:52:41Z

PR_Github #24062 [ run ] triggered by Bot. Commit: 45f971e

lucaslie · 2025-11-11T01:16:57Z

Bf16 pre:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     2.8041
Total Output Throughput (tokens/sec):             2871.3826
Total Token Throughput (tokens/sec):              5742.7653
Total Latency (ms):                               91295.3912
Average request latency (ms):                     59847.7003
Per User Output Throughput [w/ ctx] (tps/user):   19.4070
Per GPU Output Throughput (tps/gpu):              2871.3826

Bf16 post:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     2.8026
Total Output Throughput (tokens/sec):             2869.8821
Total Token Throughput (tokens/sec):              5739.7642
Total Latency (ms):                               91343.1245
Average request latency (ms):                     59720.5701
Per User Output Throughput [w/ ctx] (tps/user):   19.4566
Per GPU Output Throughput (tps/gpu):              2869.8821

Bf16 post with fp32 cache

OOM on TP=1 with consistent settings

FP8 Pre:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     5.0824
Total Output Throughput (tokens/sec):             5204.3830
Total Token Throughput (tokens/sec):              10408.7660
Total Latency (ms):                               50369.8519
Average request latency (ms):                     49102.8390
Per User Output Throughput [w/ ctx] (tps/user):   20.8603
Per GPU Output Throughput (tps/gpu):              5204.3830

FP8 Post:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     5.0757
Total Output Throughput (tokens/sec):             5197.5544
Total Token Throughput (tokens/sec):              10395.1089
Total Latency (ms):                               50436.0278
Average request latency (ms):                     49248.4491
Per User Output Throughput [w/ ctx] (tps/user):   20.7990
Per GPU Output Throughput (tps/gpu):              5197.5544

FP8 Post with fp32 cache

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     4.9643
Total Output Throughput (tokens/sec):             5083.4690
Total Token Throughput (tokens/sec):              10166.9380
Total Latency (ms):                               51567.9353
Average request latency (ms):                     50402.6484
Per User Output Throughput [w/ ctx] (tps/user):   20.3229
Per GPU Output Throughput (tps/gpu):              5083.4690

suyoggupta · 2025-11-11T01:21:36Z

can we check-in a nano_v3.yaml to the repo so that we have a version controlled source of truth on the model config

tensorrt-cicd · 2025-11-11T01:56:14Z

PR_Github #24062 [ run ] completed with state SUCCESS. Commit: 45f971e
/LLM/main/L0_MergeRequest_PR pipeline #18134 completed with status: 'FAILURE'

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie · 2025-11-11T02:13:22Z

/bot run

suyoggupta · 2025-11-11T02:13:27Z

can you please also post perf with mamba cache set to fp32?

lucaslie · 2025-11-11T02:17:52Z

can you please also post perf with mamba cache set to fp32?

updated the perf comment with TP=1, fp32 cache, fp8 checkpoint. Same settings run OOM for bf16 with fp32 cache. Do you want any other perf measurements?

tensorrt-cicd · 2025-11-11T02:19:04Z

PR_Github #24078 [ run ] triggered by Bot. Commit: 10cb1b3

suyoggupta · 2025-11-11T02:34:41Z

thanks for adding this. Won't ask for more :)

tensorrt-cicd · 2025-11-11T06:10:30Z

PR_Github #24078 [ run ] completed with state SUCCESS. Commit: 10cb1b3
/LLM/main/L0_MergeRequest_PR pipeline #18145 completed with status: 'SUCCESS'

…IDIA#8812) Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie requested a review from a team as a code owner October 30, 2025 20:56

lucaslie requested a review from MrGeva October 30, 2025 20:56

github-project-automation bot added this to AutoDeploy Board Oct 30, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Oct 30, 2025

lucaslie self-assigned this Oct 30, 2025

lucaslie moved this from Backlog to In review in AutoDeploy Board Oct 30, 2025

lucaslie linked an issue Oct 30, 2025 that may be closed by this pull request

[Feature]: AutoDeploy: Allow for specifying mamba cache dtype #8763

Closed

1 task

lucaslie requested review from Wanli-Jiang and suyoggupta October 30, 2025 21:11

lucaslie force-pushed the ll/fp32_mamba_cache branch from 9bb6822 to 518086c Compare November 7, 2025 23:54

suyoggupta reviewed Nov 10, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py Show resolved Hide resolved

lucaslie force-pushed the ll/fp32_mamba_cache branch from 518086c to 45f971e Compare November 10, 2025 19:14

lucaslie changed the title ~~[#8763][fix] AutoDeploy: correct mamba cache dtype extraction~~ [#8763][feature] AutoDeploy: configurable dtype for caching Nov 10, 2025

lucaslie requested a review from 2ez4bz November 10, 2025 19:20

govind-ramnarayan approved these changes Nov 10, 2025

View reviewed changes

lucaslie commented Nov 10, 2025

View reviewed changes

lucaslie requested a review from a team as a code owner November 11, 2025 02:04

lucaslie requested review from Shixiaowei02 and nv-guomingz November 11, 2025 02:04

configurable kvcache/mamba cache

2ed27c0

Signed-off-by: Lucas Liebenwein <[email protected]>

reference config for nano v3

10cb1b3

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie force-pushed the ll/fp32_mamba_cache branch from d8e8339 to 10cb1b3 Compare November 11, 2025 02:07

suyoggupta approved these changes Nov 11, 2025

View reviewed changes

Wanli-Jiang mentioned this pull request Nov 11, 2025

[None][feat] Nano-v3 stack PRs v2 #9062

Draft

lucaslie merged commit 6bf4e59 into NVIDIA:main Nov 11, 2025
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 11, 2025

lucaslie deleted the ll/fp32_mamba_cache branch November 11, 2025 06:17

lucaslie restored the ll/fp32_mamba_cache branch November 11, 2025 08:23

lucaslie deleted the ll/fp32_mamba_cache branch November 11, 2025 08:29

suyoggupta pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 12, 2025

[NVIDIA#8763][feature] AutoDeploy: configurable dtype for caching (NV…

e225d96

…IDIA#8812) Signed-off-by: Lucas Liebenwein <[email protected]>

[#8763][feature] AutoDeploy: configurable dtype for caching #8812

[#8763][feature] AutoDeploy: configurable dtype for caching #8812

Uh oh!

Conversation

lucaslie commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

suyoggupta commented Nov 10, 2025

Uh oh!

lucaslie commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

lucaslie left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucaslie commented Nov 11, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

lucaslie commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suyoggupta commented Nov 11, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

lucaslie commented Nov 11, 2025

Uh oh!

suyoggupta commented Nov 11, 2025

Uh oh!

lucaslie commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

suyoggupta commented Nov 11, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucaslie commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

lucaslie left a comment •

edited

Loading

lucaslie commented Nov 11, 2025 •

edited

Loading

lucaslie commented Nov 11, 2025 •

edited

Loading